An Alternative Softmax Operator for Reinforcement Learning
نویسندگان
چکیده
A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one’s weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study a differentiable softmax operator that, among other properties, is a non-expansion ensuring a convergent behavior in learning and planning. We introduce a variant of SARSA algorithm that, by utilizing the new operator, computes a Boltzmann policy with a state-dependent temperature parameter. We show that the algorithm is convergent and that it performs favorably in practice.
منابع مشابه
Deep Learning Policy Quantization
We introduce a novel type of actor-critic approach for deep reinforcement learning which is based on learning vector quantization. We replace the softmax operator of the policy with a more general and more flexible operator that is similar to the robust soft learning vector quantization algorithm. We compare our approach to the default A3C architecture on three Atari 2600 games and a simplistic...
متن کاملCold-Start Reinforcement Learning with Softmax Policy Gradient
Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity ...
متن کاملOn the Relationship between Learning Capability and the Boltzmann-Formula
In this paper a combined use of reinforcement learning and simulated annealing is treated. Most of the simulated annealing methods suggest using heuristic temperature bounds as the basis of annealing. Here a theoretically established approach tailored to reinforcement learning following Softmax action selection policy will be shown. An application example of agent-based routing will also be ill...
متن کاملThe Exploration vs Exploitation Trade-Off in Bandit Problems: An Empirical Study
We compare well-known action selection policies used in reinforcement learning like ǫ-greedy and softmax with lesser known ones like the Gittins index and the knowledge gradient on bandit problems. The latter two are in comparison very performant. Moreover the knowledge gradient can be generalized to other than bandit problems.
متن کاملThe Introduction of a Heuristic Mutation Operator to Strengthen the Discovery Component of XCS
The extended classifier systems (XCS) by producing a set of rules is (classifier) trying to solve learning problems as online. XCS is a rather complex combination of genetic algorithm and reinforcement learning that using genetic algorithm tries to discover the encouraging rules and value them by reinforcement learning. Among the important factors in the performance of XCS is the possibility to...
متن کامل